MAIN DATA SET
CollegeDataSet <- college %>%
select(INSTNM,REGION,IND_INC_AVG,GRAD_DEBT_MDN,CIP14BACHL,CIP54BACHL,C150_4,TUITIONFEE_IN,TUITIONFEE_OUT,C100_4_POOLED,RELAFFIL)
CollegeDataSet <- college %>%
select(INSTNM,REGION,IND_INC_AVG,GRAD_DEBT_MDN,CIP14BACHL,CIP54BACHL,C150_4,TUITIONFEE_IN,TUITIONFEE_OUT,C100_4_POOLED,RELAFFIL)
data1 <- CollegeDataSet %>%
select(GRAD_DEBT_MDN, REGION)
data2 <- data1 %>%
filter(REGION == 5 || REGION == 6) %>%
mutate(
REGION = recode(
REGION,
`5` = "se",
`6` = "sw"))
## Warning: Unreplaced values treated as NA as .x is not compatible. Please
## specify replacements exhaustively or supply .default
data3 <- data2 %>%
na.omit(cols = GRAD_DEBT_MDN, invert = FALSE)
college_data_tran1 <- CollegeDataSet %>%
select (CIP14BACHL,CIP54BACHL,C150_4)
Student: Fahd Saleem (G01154081)
Question: What is the disparity between various religious affiliations of institutions and the average family income of independent students compared to those of non-religious institutions. Does the religious affiliation of an institution imply students being of a higher socioeconomic class?
Variables: IND_INC_AVG, RELAFFIL
Preview: My question is in regards to the various university’s and their religious affiliation, I will be first showing the various religious institutions and how they vary in terms of an independent student’s average family income, I will also be showing the disparity between non religious institutions and religious institutions in terms of their average family income. This will be done through conducting a summary statistic, cleaning the data in attempts to facilitate the users’ view of the data in an interactive visualization, and lastly conduct a hypothesis test in which the question of whether or not there is a difference or disparity between the average family incomes will be answered!
The following portion satisfies the summary statistic requirement.
The code below serves the purpose of getting the summary statistics for non-religious institutions. Firstly as shown the dataset used is the college dataset which and the output of the following code is set to hasnoreligion in order to store the output, which will be used later on. Moving on, a filter is used to get the schools that have a religious affiliation of NA which is essentially all the institutions who have don’t have a religious affiliation. Moreover, a summarise function has been used in order to get the summary statistics such as the average, minimum, and the maximum for all institutions which are of a non-religious affiliation. In the output, we can see that the minimum value is seen to be zero; this could mean that this data might not be reliable, possibly due to a wrong entry in the original set. The summary statistics are used for the variable IND_INC_AVG, which is the independent income average for independent students. Overall this goes to show the statistics for non-religious institutions as labeled within the output.
##SUMMARY FOR INSTITUTIONS OF NON RELIGIOUS AFFILIATION
hasnoreli<-CollegeDataSet%>%
filter(is.na(RELAFFIL))%>%
summarise(
RELAFFIL = "Non-Religious Institutions",
avg = round(mean(IND_INC_AVG,na.rm = TRUE)),
min = min(IND_INC_AVG,na.rm = TRUE),
max = max(IND_INC_AVG,na.rm = TRUE),
med = median(IND_INC_AVG,na.rm = TRUE),
SD = sd(IND_INC_AVG,na.rm = TRUE),
IQR = IQR(IND_INC_AVG,na.rm = TRUE)
)
head(hasnoreli)
| RELAFFIL | avg | min | max | med | SD | IQR |
|---|---|---|---|---|---|---|
| Non-Religious Institutions | 20275 | 0 | 67438.15 | 19466.64 | 7549.05 | 8571.052 |
hasreli, meaning that the institutions have a religious affiliation. Next, we use a filter function, the filter function is checking for values which do have a valid numeric value, necessarily applicable values. The not symbol ! is used to ensure we are filtering for values that are not NA or null for the column RELAAFFIL. Next, a summarise function is used to get the summary statistics for religious institutions. The statistics consist of the average, minimum, and maximum for the independent average income of students who attend religious institutions.##SUMMARY FOR INSTITUTIONS OF RELIGIOUS AFFILIATION
hasreli<- CollegeDataSet%>%
filter(!is.na(RELAFFIL))%>%
summarise(
RELAFFIL = "Religious Institutions",
avg = round(mean(IND_INC_AVG,na.rm = TRUE)),
min = min(IND_INC_AVG,na.rm = TRUE),
max = max(IND_INC_AVG,na.rm = TRUE),
med = median(IND_INC_AVG,na.rm = TRUE),
SD = sd(IND_INC_AVG,na.rm = TRUE),
IQR = IQR(IND_INC_AVG,na.rm = TRUE)
)
head(hasreli)
| RELAFFIL | avg | min | max | med | SD | IQR |
|---|---|---|---|---|---|---|
| Religious Institutions | 25824 | 557.6104 | 55639.34 | 26464.04 | 11532.98 | 17437.88 |
RELAFFIL or religious affiliations to be filtered so that only religious affiliations that are not NA are shown. I did this because RELAFFIL has its own specific value for null values, which corresponds to a -2 or -1; therefor any values which are in actuality null are considered non-religious affiliations. As a result, they are filtered out since I want to get only the institutions of religious affiliations. Next, I have grouped them by RELAFFIL, allowing for each individual type of religious affiliation of an institution to be shown. If you see the output it is prominent that all the values under the RELAFFIL column are numerical values, how will the user possibly know which religious affiliation is which? I took this into significant consideration, as shown in within upcoming code. In some forthcoming code, I will recode all the values to make it possible for viewers of the data to see the actual religious affiliation as a string rather than an uninterpretable numerical value. Back to this code chunk, I go on to conduct a summary statistic the same as before; lastly, the code is filtered of non-applicable values just as a precaution. Within the output, all individual institutions of religious affiliation are shown and its corresponding summary statistic.##SUMMARY FOR EACH INDIVIDUAL RELIGIOUS AFFILIATION INSTITUTION
summary <- CollegeDataSet%>%
filter(!is.na(RELAFFIL))%>%
group_by(RELAFFIL) %>%
summarise(
avg = round(mean(IND_INC_AVG, na.rm = TRUE)),
min = min(IND_INC_AVG, na.rm = TRUE),
max = max(IND_INC_AVG, na.rm = TRUE),
med = median(IND_INC_AVG,na.rm = TRUE),
SD = sd(IND_INC_AVG,na.rm = TRUE),
IQR = IQR(IND_INC_AVG,na.rm = TRUE)
)%>%
filter(!is.na(avg))
head(summary)
| RELAFFIL | avg | min | max | med | SD | IQR |
|---|---|---|---|---|---|---|
| 22 | 28170 | 28170.111 | 28170.11 | 28170.111 | NA | 0.000 |
| 24 | 9987 | 7244.028 | 12730.69 | 9987.359 | 3879.657 | 2743.332 |
| 27 | 24151 | 13531.000 | 32716.35 | 26456.557 | 6712.750 | 9718.246 |
| 28 | 25369 | 10408.224 | 40329.69 | 25368.957 | 21157.671 | 14960.732 |
| 30 | 29045 | 4820.895 | 54766.45 | 30081.574 | 10567.190 | 11919.131 |
| 33 | 28901 | 13794.636 | 44006.67 | 28900.653 | 21363.133 | 15106.017 |
True and False are used. Originally without the mutate, we are shown NA values for the institutions with no religious affiliation. The recode recodes the non-religious with False. Next, the remaining default is replaced with True.##SUMMARY FOR ALL SCHOOLS AND THEIR AFFILIATIONS
summaryall <- CollegeDataSet%>%
group_by(INSTNM,RELAFFIL) %>%
summarise(
avg = round(mean(IND_INC_AVG, na.rm = TRUE)),
min = min(IND_INC_AVG, na.rm = TRUE),
max = max(IND_INC_AVG, na.rm = TRUE),
med = median(IND_INC_AVG,na.rm = TRUE),
SD = sd(IND_INC_AVG,na.rm = TRUE),
IQR = IQR(IND_INC_AVG,na.rm = TRUE)
)%>%
mutate(
RELAFFIL = recode(
RELAFFIL,
.missing = "False",
.default = "True"
))%>%
filter(!is.na(avg))
head(summaryall)
| INSTNM | RELAFFIL | avg | min | max | med | SD | IQR |
|---|---|---|---|---|---|---|---|
| Aaniiih Nakoda College | False | 9170 | 9169.706 | 9169.706 | 9169.706 | NA | 0 |
| Abcott Institute | False | 10770 | 10770.245 | 10770.245 | 10770.245 | NA | 0 |
| Abdill Career College Inc | False | 16590 | 16589.949 | 16589.949 | 16589.949 | NA | 0 |
| Abilene Christian University | True | 18267 | 18267.444 | 18267.444 | 18267.444 | NA | 0 |
| Abraham Baldwin Agricultural College | False | 25314 | 25313.870 | 25313.870 | 25313.870 | NA | 0 |
| Academia Serrant Inc | False | 4524 | 4523.507 | 4523.507 | 4523.507 | NA | 0 |
RELAFFIL, which represents the various religious affiliations. This is necessary to do since the proceeding code will incorporate an interactive visual, and if the user is looking at the visual, all the numerical religious affiliations won’t make any sense unless a very long key is provided. However, I did not want to make the user experience bad and went through the excruciating process to recode all the values for religious affiliations so that the end-user could actually know what data they are looking at. This essentially allows for the values which were originally numeric to be replaced with the string value I have assigned it. Overall this was intended to immensely facilitate the end-user experience & allows the user viewing my visualization to know which religious affiliation corresponds with what average income. I’ve included head(updat_summary), which will enable you to see how each religious affiliation now has a name associated with it. Going back to the code head(summary), you can see RELAFFIL has only numeric values associated with it, making it almost impossible to know what religious affiliation is what.updat_summary<-summary%>%
mutate(
RELAFFIL = recode(
RELAFFIL,
.default = "Not reported",
`-1`= "Not reported",
`-2`= "Not applicable",
`22`= "American Evangelical Lutheran Church",
`24`= "African Methodist Episcopal Zion Church",
`27`= "Assemblies of God Church",
`28`= "Brethren Church",
`30`= "Roman Catholic",
`33`= "Wisconsin Evangelical Lutheran Synod",
`34`= "Christ and Missionary Alliance Church",
`35`= "Christian Reformed Church",
`36`= "Evangelical Congregational Church",
`37`= "Evangelical Covenant Church of America",
`38`= "Evangelical Free Church of America",
`39`= "Evangelical Lutheran Church",
`40`= "International United Pentecostal Church",
`41`= "Free Will Baptist Church",
`42`= "Interdenominational",
`43`= "Mennonite Brethren Church",
`44`= "Moravian Church",
`45`= "North American Baptist",
`47`= "Pentecostal Holiness Church",
`48`= "Christian Churches and Churches of Christ",
`49`= "Reformed Church in America",
`50`= "Episcopal Church, Reformed",
`51`= "African Methodist Episcopal",
`52`= "American Baptist",
`53`= "American Lutheran",
`54`= "Baptist",
`55`= "Christian Methodist Episcopal",
`57`= "Church of God",
`58`= "Church of Brethren",
`59`= "Church of the Nazarene",
`60`= "Cumberland Presbyterian",
`61`= "Christian Church (Disciples of Christ)",
`64`= "Free Methodist",
`65`= "Friends",
`66`= "Presbyterian Church (USA)",
`67`= "Lutheran Church in America",
`68`= "Lutheran Church - Missouri Synod",
`69`= "Mennonite Church",
`71`= "United Methodist",
`73`= "Protestant Episcopal",
`74`= "Churches of Christ",
`75`= "Southern Baptist",
`76`= "United Church of Christ",
`77`= "Protestant, not specified",
`78`= "Multiple Protestant Denomination",
`79`= "Other Protestant",
`80`= "Jewish",
`81`= "Reformed Presbyterian Church",
`84`= "United Brethren Church",
`87`= "Missionary Church Inc",
`88`= "Undenominational",
`89`= "Wesleyan",
`91`= "Greek Orthodox",
`92`= "Russian Orthodox",
`93`= "Unitarian Universalist",
`94`= "Latter Day Saints (Mormon Church)",
`95`= "Seventh Day Adventists",
`97`= "The Presbyterian Church in America",
`99`= "Other (none of the above)",
`100`= "Original Free Will Baptist",
`101`= "Ecumenical Christian",
`102`= "Evangelical Christian",
`103`= "Presbyterian",
`105`= "General Baptist",
`106`= "Muslim",
`107`= "Plymouth Brethren"
)
)
head(updat_summary)
| RELAFFIL | avg | min | max | med | SD | IQR |
|---|---|---|---|---|---|---|
| American Evangelical Lutheran Church | 28170 | 28170.111 | 28170.11 | 28170.111 | NA | 0.000 |
| African Methodist Episcopal Zion Church | 9987 | 7244.028 | 12730.69 | 9987.359 | 3879.657 | 2743.332 |
| Assemblies of God Church | 24151 | 13531.000 | 32716.35 | 26456.557 | 6712.750 | 9718.246 |
| Brethren Church | 25369 | 10408.224 | 40329.69 | 25368.957 | 21157.671 | 14960.732 |
| Roman Catholic | 29045 | 4820.895 | 54766.45 | 30081.574 | 10567.190 | 11919.131 |
| Wisconsin Evangelical Lutheran Synod | 28901 | 13794.636 | 44006.67 | 28900.653 | 21363.133 | 15106.017 |
The following portion satisfies the data visualization requirement.
The following code binds the updat_summary, hasreli, and hasnoreli into a new entity called summary_all. This data will be further used in a visualization to allow the user to compare the various averages for different religious institutions, and make comparisons with the overall averages for both the non-religious institutions as well as the overall average for religious institutions.
summary_all <- rbind(updat_summary,hasreli, hasnoreli)
summary_all%>%
hchart(
"column",
hcaes(
x = RELAFFIL,
y = avg
)
) %>%
hc_tooltip(
useHTML = TRUE,
pointFormat = tooltip_table(
x = combine("Religious Affiliation:", "Average Family income of independent students: $"),
y = combine("{point.RELAFFIL}", "{point.avg}")
)
)
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
hasreli and the hasnoreli table contents together. The purpose of this code is used for the code below, which shows the difference between religious and non-religious institutions in terms of the average income of independent students.summary_both <- rbind(hasreli, hasnoreli)
summary_both%>%
hchart(
"column",
hcaes(
x = RELAFFIL,
y = avg
)
) %>%
hc_tooltip(
useHTML = TRUE,
pointFormat = tooltip_table(
x = combine("Religious Affiliation:", "Average Family income of independent students: $"),
y = combine("{point.RELAFFIL}", "{point.avg}")
)
)
- From this density plot, it is prominent that the highest density point is around 20,000 for institutions of no religious affiliation or false. Looking back at the interactive visualization, it matches up confirming our analysis. Next, looking at the blue-colored visual, it is prominent that the average is around 25,000 for independent individual income. This can be confirmed by looking at our interactive visualization which is similarly at $25,824 for institutions of religious affiliation.
summaryall%>%
ggplot() +
geom_density(aes(x = avg, fill = RELAFFIL),alpha = 0.5, position = "identity")+
facet_wrap(~ RELAFFIL, scales = "free_x")
#Im specifying only accounting for 4 year students at 100% completion time.
college_4_year_completion <- college %>%
select(INSTNM,TUITIONFEE_IN,TUITIONFEE_OUT,C100_4_POOLED,REGION) %>%
mutate(
REGION = recode(
REGION,
"0" = "U.S. Service Schools",
"1" = "New England Schools",
"2" = "Mid Eastern Schools",
"3" = "Great Lakes Schools",
"4" = "Great Plains Schools",
"5" = "Southeastern Schools",
"6" = "Southwestern Schools",
"7" = "Rocky Mountains Schools",
"8" = "Far Western Schools",
"9" = "Outlying Areas Schools"
))
college_4_year_completion$TUITIONFEE_IN <- as.numeric(college_4_year_completion$TUITIONFEE_IN)
college_4_year_completion$TUITIONFEE_OUT <- as.numeric(college_4_year_completion$TUITIONFEE_OUT)
college_4_year_completion$C100_4_POOLED <- as.numeric(college_4_year_completion$C100_4_POOLED)
Categories and where they are from:
0 U.S. Service Schools 1 New England (CT, ME, MA, NH, RI, VT) 2 Mid East (DE, DC, MD, NJ, NY, PA) 3 Great Lakes (IL, IN, MI, OH, WI) 4 Great Plains (IA, KS, MN, MO, NE, ND, SD) 5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) 6 Southwest (AZ, NM, OK, TX) 7 Rocky Mountains (CO, ID, MT, UT, WY) 8 Far West (AK, CA, HI, NV, OR, WA) 9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)
For this project, I will be answering:
The variables that I will be using:
CIP14BACHL –> Bachelor’s degree in Engineering.
CIP54BACHL –> Bachelor’s degree in History.
C150_4 –> Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion)
college_data_tran1 %>%
gather(CIP14BACHL:CIP54BACHL, key = "variable", value = "value") %>%
ggplot() +
geom_histogram(mapping = aes(x = C150_4), position = "identity",
bins = 50, alpha = 0.5) +
facet_wrap(. ~ variable)
## Warning: Removed 9406 rows containing non-finite values (stat_bin).
college_data_tran1 %>%
ggplot() +
geom_boxplot(
mapping = aes(x = C150_4, y = CIP14BACHL))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 4703 rows containing missing values (stat_boxplot).
college_data_tran1 %>%
ggplot() +
geom_boxplot(
mapping = aes(x = C150_4, y = CIP54BACHL))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 4703 rows containing missing values (stat_boxplot).
I picked to explore the variation in my data using the histogram. I picked the histogram because it’s a good representation of the data, it gives me an overall view of what to expect. For this graph, it is showing the graduation’s rate of colleges for the 2 majors that I have picked (Enginerring and History). The distribution of the graph seems to be normal, with a bell-shaped. It does not appear that there are differences between the graduation’s rate within colleges where students are either working toward a bachelor in engineering or history. The box plots also show us that there are no outliers.
college_data_tran1 %>%
summarize(
mean = mean(CIP14BACHL, na.rm = TRUE),
median = median(CIP14BACHL, na.rm = TRUE),
min = min(CIP14BACHL, na.rm = TRUE),
max = max(CIP14BACHL, na.rm = TRUE),
sd = sd(CIP14BACHL, na.rm = TRUE),
iqr = IQR(CIP14BACHL, na.rm = TRUE)
)
| mean | median | min | max | sd | iqr |
|---|---|---|---|---|---|
| 0.0983805 | 0 | 0 | 2 | 0.3112709 | 0 |
college_data_tran1 %>%
summarize(
mean = mean(CIP54BACHL, na.rm = TRUE),
median = median(CIP54BACHL, na.rm = TRUE),
min = min(CIP54BACHL, na.rm = TRUE),
max = max(CIP54BACHL, na.rm = TRUE),
sd = sd(CIP54BACHL, na.rm = TRUE),
iqr = IQR(CIP54BACHL, na.rm = TRUE)
)
| mean | median | min | max | sd | iqr |
|---|---|---|---|---|---|
| 0.2044801 | 0 | 0 | 2 | 0.4256297 | 0 |
The mean for Engineering is a lot less compared to the mean for History. Both have about the same statistics for other categories.
college_data_model <- lm(CIP14BACHL ~ CIP54BACHL, data = college_data_tran1)
college_data_model_df <- college_data_tran1 %>%
add_predictions(college_data_model) %>%
add_residuals(college_data_model)
college_data_model_df %>%
gather(CIP14BACHL:CIP54BACHL, key = "variable", value = "value") %>%
ggplot() +
geom_point(aes(variable, value))
## Warning: Removed 902 rows containing missing values (geom_point).
college_data_model %>%
tidy()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0204519 | 0.0036263 | 5.639868 | 0 |
| CIP54BACHL | 0.3811063 | 0.0076801 | 49.622822 | 0 |
college_data_model %>%
glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.2715682 | 0.2714579 | 0.2656843 | 2462.424 | 0 | 2 | -616.7015 | 1239.403 | 1259.791 | 466.2347 | 6605 |
In conclusion, it does not seem like there is a relationship between majors (Engineering and History) at colleges and the graduation’s rate.
data3_plot <- ggplot(data2) +
geom_histogram(mapping = aes(x = GRAD_DEBT_MDN), binwidth = 5000) +
facet_wrap(~ REGION)
The first histogram shows that the most frequent debt that students in schools have in the southeastern region appear in the 10,000s. It also indicates that there are more schools compared to the southwestern region. In the southwestern region, the histogram shows that the debt is more frequent in the 20’000s, with less counts meaning less schools..
data3_stats <- data3 %>%
group_by(REGION) %>%
summarize(
sum = sum(GRAD_DEBT_MDN),
mean = mean(GRAD_DEBT_MDN),
min = min(GRAD_DEBT_MDN),
max = max(GRAD_DEBT_MDN)
)
The southeast region has schools that on average have more student debt after graduation then southwestern schools. They also have more overall debt. ### linear model
data3_model <- lm(GRAD_DEBT_MDN ~ REGION, data = data3)
data3_model %>%
glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.0232559 | 0.0227522 | 8718.901 | 46.16685 | 0 | 2 | -20364.33 | 40734.67 | 40751.38 | 147401292561 | 1939 |
data3_model %>%
tidy()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 17274.788 | 235.8183 | 73.254657 | 0 |
| REGIONsw | -2946.452 | 433.6450 | -6.794619 | 0 |
The r squared value of this model is close to zero, indicating that there is nearly no correlation between GRAD_DEBT_MDN and region. The P-value is very small, indicating that there is a significan relationship between the two variables.
Question: Between the two tuition rates, in-state and out-of-state, does the graduation rate change based on the tuition rates of colleges, and do higher tuition rate colleges have a lower graduation rate?
For my data analysis, I am using the variables TUITIONFEE_IN, TUITIONFEE_OUT, and C100_4_POOLED which is, respectively, the tution rate of instate students, tuition rate of out of state students, and the completion rate of that college’s students based at 100% completion time, which is 4 years / 8 semesters. Since were not taking into account the students that pay out of state vs the students that pay in-state, we can assume that the graduation rate would remain the same for in-state and out-of-state tuitions in this case.
# Summary Stats of all schools:
summary_in <- college_4_year_completion %>%
summarise(
Avg_Cost_InState = mean(TUITIONFEE_IN, na.rm = TRUE),
Min_Cost_InState = min(TUITIONFEE_IN, na.rm = TRUE),
Max_Cost_InState = max(TUITIONFEE_IN, na.rm = TRUE),
Std_Cost_InState = sd(TUITIONFEE_IN, na.rm = TRUE),
med_Cost_InState = median(TUITIONFEE_IN,na.rm = TRUE),
IQR_InState = IQR(TUITIONFEE_IN,na.rm = TRUE)
)
head(summary_in)
| Avg_Cost_InState | Min_Cost_InState | Max_Cost_InState | Std_Cost_InState | med_Cost_InState | IQR_InState |
|---|---|---|---|---|---|
| 15120.68 | 0 | 74514 | 12730.86 | 11375 | 14166 |
# Summary Stats of all schools:
summary_out <- college_4_year_completion %>%
summarise(
Avg_Cost_OutOfState = mean(TUITIONFEE_IN, na.rm = TRUE),
Min_Cost_OutOfState = min(TUITIONFEE_IN, na.rm = TRUE),
Max_Cost_OutOfState = max(TUITIONFEE_IN, na.rm = TRUE),
Std_Cost_OutOfState = sd(TUITIONFEE_IN, na.rm = TRUE),
med_Cost_OutState = median(TUITIONFEE_OUT,na.rm = TRUE),
IQR_outState = IQR(TUITIONFEE_OUT,na.rm = TRUE)
)
head(summary_out)
| Avg_Cost_OutOfState | Min_Cost_OutOfState | Max_Cost_OutOfState | Std_Cost_OutOfState | med_Cost_OutState | IQR_outState |
|---|---|---|---|---|---|
| 15120.68 | 0 | 74514 | 12730.86 | 15000 | 15900 |
# Summary Stats by Region of Instate Tuition Costs.
summary_in_region <- college_4_year_completion %>%
group_by(REGION) %>%
summarise(
Avg_Cost_InState = mean(TUITIONFEE_IN, na.rm = TRUE),
Min_Cost_InState = min(TUITIONFEE_IN, na.rm = TRUE),
Max_Cost_InState = max(TUITIONFEE_IN, na.rm = TRUE),
Std_Cost_InState = sd(TUITIONFEE_IN, na.rm = TRUE),
med_Cost_InState = median(TUITIONFEE_IN,na.rm = TRUE),
IQR_InState = IQR(TUITIONFEE_IN,na.rm = TRUE)
)
summary_in_region
| REGION | Avg_Cost_InState | Min_Cost_InState | Max_Cost_InState | Std_Cost_InState | med_Cost_InState | IQR_InState |
|---|---|---|---|---|---|---|
| Far Western Schools | 14359.07 | 1104 | 54636 | 14634.406 | 8977.0 | 17255.00 |
| Great Lakes Schools | 16177.42 | 2000 | 56034 | 12161.502 | 12928.5 | 17901.00 |
| Great Plains Schools | 14292.11 | 480 | 52782 | 11061.220 | 10935.0 | 14232.75 |
| Mid Eastern Schools | 19537.28 | 2675 | 57208 | 14171.316 | 13944.0 | 20281.00 |
| New England Schools | 24583.56 | 3570 | 54770 | 16629.018 | 19156.5 | 28838.75 |
| Outlying Areas Schools | 6148.63 | 1040 | 13036 | 2041.274 | 6472.5 | 2157.00 |
| Rocky Mountains Schools | 10804.64 | 1835 | 52818 | 9175.745 | 7488.0 | 11092.00 |
| Southeastern Schools | 13130.93 | 1350 | 74514 | 10249.697 | 11075.0 | 12417.75 |
| Southwestern Schools | 10917.10 | 932 | 52498 | 9966.350 | 8000.0 | 12262.50 |
| U.S. Service Schools | 255.00 | 0 | 1020 | 510.000 | 0.0 | 255.00 |
# Summary Stats by Region of Out of State Tuition Costs.
summary_out_region <- college_4_year_completion %>%
group_by(REGION) %>%
summarise(
Avg_Cost_OutState = mean(TUITIONFEE_OUT, na.rm = TRUE),
Min_Cost_OutState = min(TUITIONFEE_OUT, na.rm = TRUE),
Max_Cost_OutState = max(TUITIONFEE_OUT, na.rm = TRUE),
Std_Cost_OutState = sd(TUITIONFEE_OUT, na.rm = TRUE),
med_Cost_OutState = median(TUITIONFEE_OUT,na.rm = TRUE),
IQR_outState = IQR(TUITIONFEE_OUT,na.rm = TRUE)
)
summary_out_region
| REGION | Avg_Cost_OutState | Min_Cost_OutState | Max_Cost_OutState | Std_Cost_OutState | med_Cost_OutState | IQR_outState |
|---|---|---|---|---|---|---|
| Far Western Schools | 18352.794 | 2840 | 54636 | 13444.171 | 13438 | 18115.00 |
| Great Lakes Schools | 19442.041 | 2000 | 56034 | 11237.597 | 16230 | 16488.00 |
| Great Plains Schools | 16095.760 | 480 | 52782 | 10891.502 | 14263 | 16537.75 |
| Mid Eastern Schools | 21974.130 | 2675 | 57208 | 13083.600 | 17648 | 18956.00 |
| New England Schools | 28584.000 | 6306 | 54770 | 14415.203 | 28656 | 23241.50 |
| Outlying Areas Schools | 6630.532 | 1040 | 15022 | 2608.042 | 6480 | 2056.50 |
| Rocky Mountains Schools | 15710.644 | 2250 | 52818 | 8879.471 | 14673 | 10008.50 |
| Southeastern Schools | 16523.072 | 1350 | 74514 | 9747.669 | 14290 | 11348.00 |
| Southwestern Schools | 14480.969 | 932 | 52498 | 9650.328 | 13438 | 11910.00 |
| U.S. Service Schools | 255.000 | 0 | 1020 | 510.000 | 0 | 255.00 |
# Graduation Rate by Region
summary_grad_rate_region <- college_4_year_completion %>%
group_by(REGION) %>%
summarise(
Avg_Grad_rate = mean(C100_4_POOLED, na.rm = TRUE),
Std_Grad_rate = sd(C100_4_POOLED, na.rm = TRUE)
)
summary_grad_rate_region
| REGION | Avg_Grad_rate | Std_Grad_rate |
|---|---|---|
| Far Western Schools | 0.3589134 | 0.2422179 |
| Great Lakes Schools | 0.3489524 | 0.2212757 |
| Great Plains Schools | 0.3428751 | 0.2054895 |
| Mid Eastern Schools | 0.3925379 | 0.2477942 |
| New England Schools | 0.4927770 | 0.2364143 |
| Outlying Areas Schools | 0.1171596 | 0.1367742 |
| Rocky Mountains Schools | 0.2774934 | 0.1995911 |
| Southeastern Schools | 0.2971091 | 0.1961184 |
| Southwestern Schools | 0.2592190 | 0.2013747 |
| U.S. Service Schools | 0.7963250 | 0.0812408 |
summary_grad_rate_region %>%
hchart(
"column",
hcaes(
x = REGION,
y = Avg_Grad_rate
)
) %>%
hc_tooltip(
useHTML = TRUE,
pointFormat = tooltip_table(
x = combine("REGION/CLASSIFICATION:", "Average Graduation Rate:"),
y = combine("{point.REGION}", "{point.Avg_Grad_rate}")
)
)
It is safe to say that we should remove the U.S. Service Schools since they do not charge tuition apart from Merchant Marine Academy; U.S. Service Schools have the highest average graduation rate of 79.63%. This could possibly cause some major outlying in the hypothesis testing. We will do a linear regression with service schools, and without and compare the results.
without_service <- college_4_year_completion %>%
filter(REGION != "U.S. Service Schools")
with_service <- college_4_year_completion
Linear Model
without_service_model <- lm(TUITIONFEE_IN ~ C100_4_POOLED,data=without_service)
with_service_model <- lm(TUITIONFEE_IN ~ C100_4_POOLED, data=with_service)
without_service_model %>%
glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.4982165 | 0.4979771 | 9674.265 | 2081.1 | 0 | 2 | -22229.75 | 44465.5 | 44482.45 | 196167589885 | 2096 |
without_service_model %>%
tidy()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6677.646 | 382.7839 | 17.44495 | 0 |
| C100_4_POOLED | 42395.900 | 929.3459 | 45.61907 | 0 |
with_service_model %>%
glance() %>%
as_data_frame()
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.4840636 | 0.483818 | 9822.425 | 1970.27 | 0 | 2 | -22304.08 | 44614.16 | 44631.12 | 202608075568 | 2100 |
with_service_model %>%
tidy() %>%
as_data_frame()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6833.177 | 388.1693 | 17.60360 | 0 |
| C100_4_POOLED | 41722.214 | 939.9495 | 44.38772 | 0 |
without_service_model %>%
glance() %>%
as_data_frame()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.4982165 | 0.4979771 | 9674.265 | 2081.1 | 0 | 2 | -22229.75 | 44465.5 | 44482.45 | 196167589885 | 2096 |
without_service_model %>%
tidy() %>%
as_data_frame()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6677.646 | 382.7839 | 17.44495 | 0 |
| C100_4_POOLED | 42395.900 | 929.3459 | 45.61907 | 0 |
with_service_df <- with_service %>%
add_residuals(with_service_model) %>%
add_predictions(with_service_model)
without_service_df <- without_service %>%
add_residuals(without_service_model) %>%
add_predictions(without_service_model)
with_service_df %>%
ggplot() +
geom_point(aes(x = C100_4_POOLED, y = TUITIONFEE_IN)) +
geom_line(aes(x = C100_4_POOLED, y = pred), color="red",size=1)
## Warning: Removed 4956 rows containing missing values (geom_point).
## Warning: Removed 4911 rows containing missing values (geom_path).
without_service_df %>%
ggplot() +
geom_point(aes(x = C100_4_POOLED, y = TUITIONFEE_IN)) +
geom_line(aes(x = C100_4_POOLED, y = pred), color="red", size=1)
## Warning: Removed 4954 rows containing missing values (geom_point).
## Warning: Removed 4909 rows containing missing values (geom_path).
with_service_df %>%
ggplot(mapping = aes(x = resid)) +
geom_histogram(binwidth = 100)
## Warning: Removed 4956 rows containing non-finite values (stat_bin).
without_service_df %>%
ggplot(mapping = aes(x = resid)) +
geom_histogram(binwidth = 100)
## Warning: Removed 4954 rows containing non-finite values (stat_bin).
Comparing the two data frames, one with service schools and one without, we can see that there really is no reason to remove the service schools, except for some minor improvements in the r^2 value.
In conclusion, there really isnt much of a correlation in tuition and graduation rate. Colleges seem to be all over the place, and the r^2 value of ~0.49 is not high enough to really say there is a correlation between tuition rates and graduation rates.
The following portion satisfies hypothesis testing requirement.
Needed to create the simulated null distribution, in order to quantify how likely it is that a random chance model would generate the experimental result.
summaryall_test <- summaryall %>%
specify(formula = avg ~ RELAFFIL) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = combine("True", "False"))
head(summaryall_test)
| replicate | stat |
|---|---|
| 1 | -12.99704 |
| 2 | -18.50042 |
| 3 | 575.22292 |
| 4 | 299.64724 |
| 5 | -96.07602 |
| 6 | 49.64035 |
summaryall_test_stat <- summaryall%>%
specify(formula = avg ~ RELAFFIL)%>%
calculate(stat = "diff in means", order = combine("True", "False"))
head(summaryall_test_stat)
| stat |
|---|
| 5484.92 |
summaryall_test%>%
get_p_value(obs_stat = summaryall_test_stat, direction = "both")
| p_value |
|---|
| 0 |
summaryall_test%>%
visualize() +
shade_p_value(obs_stat = summaryall_test_stat, direction = "both")
bootstrap_summaryall <- summaryall%>%
specify(formula = avg ~ RELAFFIL)%>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = combine("True", "False"))
head(bootstrap_summaryall)
| replicate | stat |
|---|---|
| 1 | 5509.379 |
| 2 | 5687.259 |
| 3 | 4793.777 |
| 4 | 6111.076 |
| 5 | 5863.339 |
| 6 | 5185.511 |
summaryall_ci <- bootstrap_summaryall %>%
get_confidence_interval()
bootstrap_summaryall %>%
visualise() + shade_confidence_interval(summaryall_ci)
RELAFFIL and IND_INC_AVG. Then the summary statistics portion was conducted. The overall summary statistics for religious and non-religious institutions were performed. Moreover, the summary statistics for each individual religious affiliation was also conducted to be used in an interactive visualization for the end-user to view and compare. In addition to the full visualization, a more specific visual was created, showing just the two bars for the religious versus non-religious institutions. This visual revealed that there, in fact, is a difference between both and that those religious institutions did have students of a higher average income. Moreover, a density plot was made to see the density of the various averages; when compared to the prior visualization, the average incomes lined up with their corresponding religious affiliation, either false or true. Lastly, the hypothesis testing portion was conducted, in which it was found that the probability of accepting the null hypothesis was zero; as a result, we have a failure to accept the null hypothesis. Overall now it is prominent that, in fact, a disparity does exist and that religious institutions have independent students who get a higher average income. This can mean that in general, students of a religious institution have a higher average independent income, making them of a higher socioeconomic standing when compared to independent individuals who attend non religiously affiliated institutions.